Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(pkg/repository/maintenance): handle when there's no container status #8271

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

mcluseau
Copy link

@mcluseau mcluseau commented Oct 6, 2024

Fixes a crash in this case.

panic: runtime error: index out of range [0] with length 0 [recovered]
	panic: runtime error: index out of range [0] with length 0

goroutine 562 [running]:
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile.func1()
	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:116 +0x1e5
panic({0x2ae2cc0?, 0xc001a084b0?})
	/usr/local/go/src/runtime/panic.go:770 +0x132
github.com/vmware-tanzu/velero/pkg/repository.GetMaintenanceResultFromJob({0x318cd00, 0xc0009be3f0}, 0xc0025ae508)
	/go/src/github.com/vmware-tanzu/velero/pkg/repository/maintenance.go:120 +0x229

@mcluseau
Copy link
Author

mcluseau commented Oct 7, 2024

Hi @mateusoliveira43 , sure, I've updated the comments and the error

@mcluseau
Copy link
Author

mcluseau commented Oct 7, 2024

if you're like me, you'd notice rust would have made the problem much more obvious in the first place 😅

@mateusoliveira43
Copy link
Contributor

@mcluseau yeah, Go errors are almost unreadable, thanks for taking time to fix this!

Copy link

codecov bot commented Oct 7, 2024

Codecov Report

Attention: Patch coverage is 50.00000% with 4 lines in your changes missing coverage. Please review.

Project coverage is 59.21%. Comparing base (42de654) to head (1e3e027).
Report is 6 commits behind head on main.

Files with missing lines Patch % Lines
pkg/repository/maintenance.go 50.00% 2 Missing and 2 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #8271      +/-   ##
==========================================
- Coverage   59.21%   59.21%   -0.01%     
==========================================
  Files         367      367              
  Lines       30841    30849       +8     
==========================================
+ Hits        18262    18266       +4     
- Misses      11119    11121       +2     
- Partials     1460     1462       +2     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@kaovilai
Copy link
Contributor

kaovilai commented Oct 7, 2024

@mcluseau please add changelog

@kaovilai
Copy link
Contributor

kaovilai commented Oct 7, 2024

Add changelog here
https://github.com/vmware-tanzu/velero/tree/main/changelogs/unreleased
File name: <pr-num>-<gh-username>

Copy link
Contributor

@kaovilai kaovilai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO. but not blocked on it.

changelogs/unreleased/8271-mcluseau Outdated Show resolved Hide resolved

statuses := pod.Status.ContainerStatuses
if len(statuses) == 0 {
return "", fmt.Errorf("no container statuses found for job %s", job.Name)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The termination-log is not always written, it is only written when maintenance fails.
So I suspect it is a non-error case when len(statuses) == 0

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to me, "no statuses" is not the expected state even if the container terminated successfully:
https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/#container-state-terminated

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I may have pointed the comment to wrong code. The most suspecting line is below:

	terminated := statuses[0].State.Terminated
	if terminated == nil {
		return "", fmt.Errorf("container for job %s is not terminated", job.Name)

As you can see from the maintenance code, if the maintenance succeeds, no termination message is written, so statuses[0].State.Terminated may indeed be nil. But this is a normal case.

Furthermore, when nothing is set to pod.Status.ContainerStatuses , will it be nil? If so, it is still a normal case.

Please double confirm this.

Copy link
Author

@mcluseau mcluseau Oct 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I answered to the State.Terminated case too :) there's not just the termination log, there's also the exit code, time, etc.

About container statuses, they are expected to be filled (from waiting to running to terminated) but they may not be present when the pod was just created (I guess a kind of race condition leads the initial but mentioned here). So, my interpretation is that when it's empty, it's not a normal state and can't be after the container's termination: ie, the caller made the call too soon for some reason, very in line with semantics of asking before the pod was created by the job, which is a case that returns an error.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants